Conversation
This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
|
Hi @vsoch. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
🚫 This command cannot be processed. Only organization members or owners can use the commands. |
|
@vsoch Can you update this guide for Flux in Trainer please? |
|
Sure thing - I'll bring up a cluster on AWS today and test out using the (now merged) main branch to run it. |
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@andreyvelich I've updated the demo after testing on AWS, and to use assets from the trainer repository directly. Do you want me to include an example that uses AWS EFA? Apologies because I think I asked this before, but does the kubeflow trainer have support for one off resources like efa? |
|
Found it! Putting here so I remember next time. 🙃 $ kubectl explain trainjob.spec.trainer.resourcesPerNode
GROUP: trainer.kubeflow.org
KIND: TrainJob
VERSION: v1alpha1
FIELD: resourcesPerNode <Object>
DESCRIPTION:
resourcesPerNode defines the compute resources for each training node.
FIELDS:
claims <[]Object>
Claims lists the names of resources, defined in spec.resourceClaims,
that are used by this container.
This field depends on the
DynamicResourceAllocation feature gate.
This field is immutable. It can only be set for containers.
limits <map[string]Object>
Limits describes the maximum amount of compute resources allowed.
More info:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
requests <map[string]Object>
Requests describes the minimum amount of compute resources required.
If Requests is omitted for a container, it defaults to Limits if that is
explicitly specified,
otherwise to an implementation-defined value. Requests cannot exceed Limits.
More info:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ |
|
@andreyvelich for your FYI, we are going to be showing off Kubeflow Trainer at the High Performance Software Foundation (HPSF) meeting next week! https://events.linuxfoundation.org/hpsf-conference/program/schedule/ The best part of that abstract might be the title :) |
Sure, you can add this into Flux examples: https://github.com/kubeflow/trainer/tree/master/examples/flux or Trainer documentation.
This is awesome! We should definitely promote it through our outreach channels. cc: @kubeflow/kubeflow-outreach-committee @kubeflow/kubeflow-steering-committee @kubeflow/kubeflow-trainer-team. @tarekabouzeid @yashpal2104, could you help share this on Kubeflow’s social channels? Highlighting that Kubeflow Trainer is being used for HPC workloads would be especially impactful and highly relevant for the AI community. |
Description of Changes
This changeset adds documentation (a user guide) to use the Flux Policy in the Kubeflow Trainer. The example includes running a popular simulation, LAMMPS, with CPU. A GPU example is desired and will be added likely in a separate changeset.
Related Issues
This is linked with a pull request to the trainer,
Related: kubeflow/trainer#3064.
I did not open an issue here (and can if needed, please let me know).
Checklist
cc @milroy